Installing and Attaching Packages

We will be using librarian for package management as it greatly simplifies the process of installing and attaching packages. It needs to be installed using install.packages() unless already installed.

install.packages('librarian')

The librarian::shelf() function checks whether all listed packages are installed and installs them if needed. Then it loads all listed packages into the environment. You can think of it as a combination of install.packages() and library() except it will not re-install a package if it is already installed. Here are the packages we will be using:

librarian::shelf(readr, janitor, dplyr, magrittr, tidyr, tidyselect, ggplot2, 
                 plotly, gganimate, gifski)

Reading and Cleaning Data

We will be using the following data files from the data directory to investigate the relationship between health and wealth:

All the data are in IEFT RFC 4180 CSV (comma-separated values) format and the first four rows of the World Bank data files contain metadata with the actual data table starting on row five.

Let us start with the population data. CSV data files can be easily read in using the readr::read_csv() function. The function reads the contents of the file into a tibble object (which is basically an advanced data frame) and supports various additional arguments. For example, we can utilize the skip argument to skip the first four rows of the CSV file as the data table does not start until row five.

population <- readr::read_csv(file = 'data/population.csv',
                              skip = 4,
                              show_col_types = FALSE)

Upon file read, the function outputs a summary outlining the column names and types of the data by default. This can be silenced by specifying the show_col_types=FALSE when calling the function.

Now the World Bank population data is stored in a tibble named population. Calling the name of the variable will display the data in a neat interactive table viewer.

population

We see that the tibble appears to have the following columns:

We also see that the table has 266 rows and 66 columns. The number of rows and columns of a tibble or other data-frame-like object can also be programmatically extracted by calling the dim() function.

dim(population)
## [1] 266  66

The list of column names can be programmatically extracted via the names() function.

names(population)
##  [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
##  [5] "1960"           "1961"           "1962"           "1963"          
##  [9] "1964"           "1965"           "1966"           "1967"          
## [13] "1968"           "1969"           "1970"           "1971"          
## [17] "1972"           "1973"           "1974"           "1975"          
## [21] "1976"           "1977"           "1978"           "1979"          
## [25] "1980"           "1981"           "1982"           "1983"          
## [29] "1984"           "1985"           "1986"           "1987"          
## [33] "1988"           "1989"           "1990"           "1991"          
## [37] "1992"           "1993"           "1994"           "1995"          
## [41] "1996"           "1997"           "1998"           "1999"          
## [45] "2000"           "2001"           "2002"           "2003"          
## [49] "2004"           "2005"           "2006"           "2007"          
## [53] "2008"           "2009"           "2010"           "2011"          
## [57] "2012"           "2013"           "2014"           "2015"          
## [61] "2016"           "2017"           "2018"           "2019"          
## [65] "2020"           "2021"

Note how the first four column names contain a space and the rest of the column names are all numbers. This is bad practice as various functions do not work well with column names that either contain a space or start with a number. It is common to replace all spaces with either periods or underscores and add a letter prefix to any column names consisting solely of numbers.

This can be easily achieved by using the janitor::clean_names() function, which takes the original table as input and outputs a new table where the column names have been properly formatted.

population <- janitor::clean_names(population)
names(population)
##  [1] "country_name"   "country_code"   "indicator_name" "indicator_code"
##  [5] "x1960"          "x1961"          "x1962"          "x1963"         
##  [9] "x1964"          "x1965"          "x1966"          "x1967"         
## [13] "x1968"          "x1969"          "x1970"          "x1971"         
## [17] "x1972"          "x1973"          "x1974"          "x1975"         
## [21] "x1976"          "x1977"          "x1978"          "x1979"         
## [25] "x1980"          "x1981"          "x1982"          "x1983"         
## [29] "x1984"          "x1985"          "x1986"          "x1987"         
## [33] "x1988"          "x1989"          "x1990"          "x1991"         
## [37] "x1992"          "x1993"          "x1994"          "x1995"         
## [41] "x1996"          "x1997"          "x1998"          "x1999"         
## [45] "x2000"          "x2001"          "x2002"          "x2003"         
## [49] "x2004"          "x2005"          "x2006"          "x2007"         
## [53] "x2008"          "x2009"          "x2010"          "x2011"         
## [57] "x2012"          "x2013"          "x2014"          "x2015"         
## [61] "x2016"          "x2017"          "x2018"          "x2019"         
## [65] "x2020"          "x2021"

We know that the population table stores population values, so the columns indicator_name and indicator_code are redundant. These can be dropped using the dplyr::select() function. The first argument to the function is the table variable itself and following arguments specify which columns to keep or drop. For example, explicitly specifying a column name will result in only the listed column being selected and adding a minus sign in front will reverse the selection, meaning any specified column name preceded with a minus sign will cause it to be dropped and all remaining columns to be selected.

The output is a new table with the specified selection applied. There are numerous different ways one can select columns using dplyr::select(). For example, one could use a tidyselect function to select a set of columns based on a criterion. Refer to the function documentation for more information.

population <- dplyr::select(population, -indicator_name, -indicator_code)

We can validate that the desired columns have been removed by either listing the column names via the names() function again or taking a quick peek at the table using the head() function. It displays the first six rows of the table by default, but a this can be changed by specifying the n argument.

head(population)

Knowing that the World Bank GDP data follows the exact same format as the World Bank population data , we can read the CSV file, clean the column names, and drop the unneeded columns all in one go by combining the readr::read_csv(), janitor::clean_names(), and dplyr::select() functions using the forward pipe operator %>% from the magrittr package. The %>% operator takes either a variable or the output of a function and feeds it into the following function as its first argument.

gdp <- readr::read_csv(file = 'data/gdp.csv',
                       skip = 4, 
                       show_col_types = FALSE) %>%
    janitor::clean_names() %>%
    dplyr::select(-indicator_name, -indicator_code)

In the pipeline above, the tibble outputted by the readr::read_csv() function gets piped into the janitor::clean_names() function, the output of which gets passed into the dplyr::select() function. The output of the final function is saved into the variable gdp. The head() function can be used to take a quick look at the new table and ensure the constructed pipeline worked as expected.

head(gdp)

Long vs Wide Data

GDP on its own is not a good indicator of the wealth of a country as countries with more people tend to have higher GDP. But if we were to normalize GDP by population, then the resulting GDP per capita values can be compared across countries and used as a proxy for wealth. To do so, we must be able to match up the GDP and population values for each unique combination of country and year.

The GDP and population tables are currently in wide format – each row represents a unique country and each column represents a unique year with the cell values representing GDP or population estimates. While this wide format has many advantages and is commonly used in geospatial applications, it does complicate joining various data sets. One option would be to treat both tables as matrices and calculate GDP per capita by dividing the GDP matrix with the population matrix. However, both tables need to have the exact same layout with the same number of countries and years in the same exact order for this to work and the result to be reliable. Ensuring this is not a trivial task, so this method would involve a lot of work to produce reliable results.

Alternatively the two tables could be joined by country. Then we will have an extra-wide table with two sets of year columns – one set of year columns for population and another set of year columns for GDP. Then we would need to create another new column for each year by dividing the corresponding GDP column with the corresponding population column, resulting in another new set of year columns. As you can see, this approach would quickly lead to a very messy and difficult to manage table and would also involve a lot of work, making it far from preferred.

The easiest option for calculating GDP per capita would involve converting both data tables into a long format, where each row represents a single unique observation (estimation). Instead of having countries in rows and years in columns, each row would instead represent a unique country and year combination. This would allow us to easily combine data on both country and year, ensuring that the population and GDP values for each country-year combination get matched appropriately.

We can use the tidyr::pivot_longer() function to covert wide format tables into long format tables. The following arguments are of interest to us when calling the function:

The columns we would like to convert to long format are the year columns. These are all prefixed with an x character. We can use the tidyselect::start_with() function to select all the columns beginning with an x character and pass those as the cols argument. The column names should be stripped of the preceding x character, so we pass that as the names_prefix argument. The stripped column names represent years and the cell values represent population represent population estimates, so finally we specify names_to = "year" and values_to = "population" as the final arguments.

population_long <- tidyr::pivot_longer(data = population,
                                       cols = tidyselect::starts_with('x'),
                                       names_to = 'year',
                                       names_prefix = 'x',
                                       values_to = 'population')
population_long

Now we have a new long population tibble called population_long, where each row represents a unique country and year combination. But note how the data type of the year column appears to be listed as chr, implying that the data is in textual character format instead of numbers. Let us confirm this by extracting the column using the $ operator and checking its data type via the class() function.

class(population_long$year)
## [1] "character"

The dplyr::mutate() function can be combined with the base as.integer() function to convert the year column back into numeric format. The first argument of the dpylr::mutate() function is the data table and subsequent arguments specify how to create new columns or modify existing columns.

For example, to apply the as.integer() function on the whole year column and then replace the year column with the new values, we would specify year = as.integer(year) as an argument.

population_long <- dplyr::mutate(population_long,
                                 year = as.integer(year))
class(population_long$year)
## [1] "integer"

Note how now the data type of the year column is listed as int or integer, meaning it is numeric.

Now let us convert the GDP table into long format as well. As with the population table, we will first need to use tidyr::pivot_longer() to pivot the table and then as.numeric() with dplyr::mutate() to fix the data type of the year column. These can be combined into a pipeline using the %>% operator.

gdp_long <- gdp %>%
    tidyr::pivot_longer(cols = tidyselect::starts_with('x'),
                        names_to = 'year',
                        names_prefix = 'x',
                        values_to = 'gdp') %>%
    dplyr::mutate(year = as.integer(year))

gdp_long

Joining Tables

Finally we are ready to combine the population and GDP tables. The dplyr package has four different join functions we could utilize:

All the aforementioned functions require at least three arguments:

We would like to join on each unique country and year combination. As spellings of country names might differ between tables, it is good practice to always use the ISO 3166-1 alpha-3 country code or some other analogous unique identifier to distinguish between countries. The country code for each country is determined by an international standard and should not differ between tables, allowing us to reliably join the data. Hence we will specify by = c("country_code", "year") to perform the join on unique country-year combinations use the dplyr::inner_join() function to only keep year-country combinations that are present in both tables.

data <- dplyr::inner_join(x = population_long,
                          y = gdp_long,
                          by = c('country_code', 'year'))
head(data)

Note how now the data are joined but the resulting table has two different country_name columns. That is because this column was present in both of the joined tables and was not omitted before the join. We can use the dplyr::select() to drop one of the columns by adding a minus sign in front of its name and rename the other one using the new_name = old_name convention. Then we can use dplyr::mutate() to add a new gdp_per_capita column by dividing the gdp column with the population column. We can combine these two operations into a pipeline using the %>% operator.

data <- data %>%
    dplyr::select(-country_name.y,
                  country_name = country_name.x) %>%
    dplyr::mutate(gdp_per_capita = gdp / population)

data

Defining Functions

Now we would also like to add life expectancy information to this joined table. Knowing that all World Bank data tables follow the same format, we can easily convert the workflow from before into a function that reads in a World Bank data table, drops unneeded columns, converts it to long format, and ensures the year is in numeric format. That function would only need two inputs – the path of the CSV file and the name of the indicator represented by the data. (This name will be used as the column name for the values column in the long format table.) Let us define this function and use it to read in the World Bank life expectancy table and convert it to long format.

read_world_bank_data <- function(file_path, variable_name) {
    readr::read_csv(file_path,
                    skip = 4,
                    show_col_types = FALSE) %>%
        janitor::clean_names() %>%
        dplyr::select(-indicator_name,
                      -indicator_code) %>%
        tidyr::pivot_longer(cols = starts_with('x'),
                            names_to = 'year',
                            names_prefix = 'x',
                            values_to = variable_name) %>%
        dplyr::mutate(year = as.integer(year)) %>%
        return()
}
life_expectancy <- read_world_bank_data('data/life-expectancy.csv',
                                        'life_expectancy')
head(life_expectancy)

Now we can use a pipeline to drop the redundant country_name column and then join the life expectancy table with the rest of our data. Remember that the output of the last function in a pipeline is specified as the first argument of the following function by default. We can override this by referring to the output of the previous function as . and specifying it elsewhere int the following function call.

data <- life_expectancy %>%
    dplyr::select(-country_name) %>%
    dplyr::inner_join(x = data, 
                      y = .,
                      by = c('country_code', 'year'))
data

Finally we would also like to know which United Nations regional geoscheme the country belongs to. Information on this is available in the United Nations M49 table. We can utilize a pipeline to read in and clean the table all in one go. Note that as this is a standard CSV table, there is no need to skip any rows.

m49 <- readr::read_csv(file = 'data/m49.csv',
                       show_col_types = FALSE) %>%
    janitor::clean_names()

head(m49)

Note how this table contains a lot of information on the various groups and codes assigned to each country. We are only interested in the name of the region the country belongs into and need the ISO 3166-1 alpha-3 code assigned to the country to join the data. Let us use the %>% operator to create a pipeline that selects the desired columns and then performs the join.

data <- m49 %>%
    dplyr::select(region_name,
                  country_code = iso_alpha3_code) %>%
    dplyr::inner_join(data,
                      by = 'country_code') %>%
    dplyr::select(country_name,
                  country_code,
                  region_name,
                  year,
                  tidyselect::everything())
data

Note how we can use dplyr::select() along with tidyselect:everything() to rearrange the order of specified columns.

Row Filtering and Static Line Graphs

data %>%
    dplyr::filter(country_code == 'USA') %>%
    head()
data %>%
    dplyr::filter(country_code == 'USA') %>%
    ggplot2::ggplot(mapping = ggplot2::aes(x = year, y = population)) +
    ggplot2::geom_line()

data %>%
    dplyr::filter(country_code %in% c('USA', 'CAN', 'MEX')) %>%
    ggplot2::ggplot(mapping = ggplot2::aes(x = year,
                                           y = gdp_per_capita,
                                           color = country_name)) +
    ggplot2::geom_line()

Visualizing Distributions and Correlations

data2020 <- data %>%
    dplyr::filter(year == 2020) %>%
    tidyr::drop_na()

head(data2020)
ggplot2::ggplot(data = data2020,
                mapping = ggplot2::aes(x = life_expectancy)) +
    ggplot2::geom_histogram(binwidth = 1)

ggplot2::ggplot(data = data2020, 
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy)) +
    ggplot2::geom_point()

ggplot2::ggplot(data = data2020, 
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy)) +
    ggplot2::geom_density2d_filled()

ggplot2::ggplot(data = data2020, 
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy)) +
    ggplot2::geom_point() +
    ggplot2::scale_x_log10()

ggplot2::ggplot(data = data2020, 
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy)) +
    ggplot2::geom_density2d_filled() +
    ggplot2::scale_x_log10()

ggplot2::ggplot(data = data2020, 
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy,
                                       color = region_name,
                                       size = population)) +
    ggplot2::geom_point() +
    ggplot2::scale_x_log10()

ggplot2::ggplot(data = data2020, 
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy,
                                       color = region_name,
                                       size = population)) +
    ggplot2::geom_point() +
    ggplot2::scale_x_log10() +
    ggplot2::scale_size(range = c(1,10))

ggplot2::ggplot(data = data2020, 
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy)) +
    ggplot2::geom_point(mapping = ggplot2::aes(color = region_name,
                                               size = population)) +
    ggplot2::scale_x_log10() +
    ggplot2::scale_size(range = c(1,10)) +
    ggplot2::geom_smooth(formula = y ~ x, method = 'lm')

plot <- ggplot2::ggplot(data = data2020, 
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy)) +
    ggplot2::geom_point(mapping = ggplot2::aes(color = region_name,
                                               size = population)) +
    ggplot2::scale_x_log10() +
    ggplot2::scale_size(range = c(1,10)) +
    ggplot2::geom_smooth(formula = y ~ x,
                         method = 'lm',
                         se = FALSE,
                         color = 'black',
                         linetype = 'dashed') +
    ggplot2::labs(title = 'Health and Wealth by Coutnry in 2020',
                  x = 'GDP per Capita (USD)',
                  y = 'Life Expectancy at Birth',
                 color = 'Region',
                 size = 'Population')
plot

Creating Interactive Visualizations

plotly::plot_ly(data = data2020,
                x = ~gdp_per_capita,
                y = ~life_expectancy,
                color = ~region_name,
                text = ~country_name,
                size = ~population,
                type = 'scatter',
                mode = 'markers',
                sizes = c(5, 50),
                marker = list(symbol = 'circle',
                              sizemode = 'diameter')) %>%
    plotly::layout(xaxis = list(type = 'log'))
plotly::ggplotly(plot)

Creating Animatied Visualizations

animation_data <- data %>%
    tidyr::drop_na() %>%
    dplyr::group_by(country_code) %>%
    dplyr::filter(n() == 61)

animation_data
animated_plot <- ggplot2::ggplot(data = animation_data,
                                 mapping = ggplot2::aes(x = gdp_per_capita,
                                                        y = life_expectancy)) +
    ggplot2::geom_point(mapping = ggplot2::aes(color = region_name,
                                               size = population)) +
    ggplot2::scale_x_log10() +
    ggplot2::scale_size(range = c(1,10)) +
    ggplot2::geom_smooth(formula = y ~ x,
                         method = 'lm',
                         se = FALSE,
                         color = 'black',
                         linetype = 'dashed') +
    ggplot2::labs(title = 'Health and Wealth by Coutnry in {frame_time}',
                  x = 'GDP per Capita (USD)',
                  y = 'Life Expectancy at Birth',
                  color = 'Region',
                  size = 'Population') +
    gganimate::transition_time(year) +
    gganimate::ease_aes('linear')

animation <- gganimate::animate(plot = animated_plot,
                                renderer = gganimate::gifski_renderer(),
                                duration = 15,
                                width = 5, 
                                height = 5,
                                units = 'in',
                                res = 72)
gganimate::anim_save(filename = 'animation.gif',
                     animation = animation)
knitr::include_graphics('animation.gif')